Skip to content

[PyTorch][Core][JAX] Expand troubleshooting docs#2602

Open
jberchtold-nvidia wants to merge 5 commits intoNVIDIA:mainfrom
jberchtold-nvidia:jberchtold/expand-troubleshooting-installation-docs
Open

[PyTorch][Core][JAX] Expand troubleshooting docs#2602
jberchtold-nvidia wants to merge 5 commits intoNVIDIA:mainfrom
jberchtold-nvidia:jberchtold/expand-troubleshooting-installation-docs

Conversation

@jberchtold-nvidia
Copy link
Collaborator

@jberchtold-nvidia jberchtold-nvidia commented Jan 15, 2026

Description

Expand the troubleshooting installation docs with a few recently debugged issues.
image

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Expand troubleshooting docs to add some recently debugged issues, including with uv venvs and JAX-specific issue symptoms.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 15, 2026

Greptile Overview

Greptile Summary

  • Extends the README troubleshooting section with guidance for uv/virtualenv installation pitfalls (import errors, cuDNN sublibrary loading failures, wheel build/runtime mismatches).
  • Adds a JAX-focused troubleshooting entry for missing CUDA custom-call registrations (FFI).
  • Keeps changes localized to documentation near the existing troubleshooting marker so downstream docs tooling can still anchor the section.

Confidence Score: 4/5

  • This PR is largely safe to merge as a docs-only change once the remaining README correctness/formatting issues are resolved.
  • Changes are limited to README troubleshooting text, so blast radius is low. However, there are still concrete docs correctness/formatting issues in the new section (one already flagged in prior review threads and additional RST list/heading formatting that can break rendering), which should be fixed before merge.
  • README.rst (new troubleshooting section formatting/consistency)

Important Files Changed

Filename Overview
README.rst Adds troubleshooting subsections for uv/virtualenv installs and JAX FFI errors; main issues are RST formatting (extra blank lines before code-block) and a contradictory build-isolation statement already noted in prior threads.

Sequence Diagram

sequenceDiagram
  participant User
  participant UV as uv/venv
  participant Pip as pip/uv pip
  participant Build as TE build (PEP517)
  participant TE as transformer_engine
  participant JAX

  User->>UV: Activate virtual environment
  User->>Pip: Install TE (uv pip install --no-build-isolation ...)
  Pip->>Build: Build TE without isolation
  Build-->>Pip: Wheel / install artifacts
  Pip-->>TE: Importable package in venv

  User->>TE: Run workload
  alt cuDNN sublibrary loading failed
    TE->>TE: dlopen cuDNN libs
    TE-->>User: CUDNN_STATUS_SUBLIBRARY_LOADING_FAILED
    User->>UV: Ensure venv cuDNN packages used
    User->>Build: Set CUDNN_PATH/CUDNN_HOME/LD_LIBRARY_PATH
  end

  alt JAX FFI not registered
    TE->>JAX: Register custom calls during init
    JAX-->>User: No registered implementation for custom call (CUDA)
    User->>Pip: Reinstall/build with --no-build-isolation
  end
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

jberchtold-nvidia and others added 2 commits January 15, 2026 09:53
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile


.. code-block:: bash

export CUDNN_PATH=$(pwd)/.venv/lib/python3.12/site-packages/nvidia/cudnn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: hardcoded Python version may not work for all users - consider using a generic placeholder like pythonX.Y or explaining users should adjust this

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +318 to +319


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: extra blank line - RST should have only one blank line before code blocks (see lines 305-306 for consistent formatting)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
@jberchtold-nvidia
Copy link
Collaborator Author

/te-ci

@jberchtold-nvidia jberchtold-nvidia added the documentation Improvements or additions to documentation label Jan 15, 2026
@pggPL pggPL self-requested a review January 19, 2026 22:13
@jberchtold-nvidia
Copy link
Collaborator Author

/te-ci

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@@ -315,6 +315,37 @@ Troubleshooting
cd transformer_engine
pip install -v -v -v --no-build-isolation .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RST list nesting broken
The new troubleshooting section uses 1./2. numbered items with nested bullet * **Symptoms:** / * **Solution:** lines, but there’s no blank line separating the list item from the nested bullet list. In reStructuredText this often breaks nesting/formatting (the * bullets can get treated as literal text or start a new top-level list). Add a blank line after each numbered item title (e.g., after 1. **Import Error:**) before the indented * bullets, and likewise for the JAX section.

Also appears at README.rst:325, README.rst:338, and README.rst:346 (same pattern).

Comment on lines +331 to +333
.. code-block:: bash

export CUDNN_PATH=$(pwd)/.venv/lib/python3.12/site-packages/nvidia/cudnn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra blank lines
There are two blank lines before the .. code-block:: bash directive. In RST, extra blank lines inside list items can cause the directive to detach from the list item and/or render with unexpected spacing. Reduce to a single blank line before the directive so it stays correctly nested under the Solution: bullet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant